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Abstract 

One of the most popular class of tests for independence between two ran- 
dom variables is the general class of rank statistics which are invariant under 
permutations. This class contains Spearman's coefficient of rank correlation 
statistic, Fisher- Yates statistic, weighted Mann statistic and others. Under the 
null hypothesis of independence these test statistics have a permutation dis- 
tribution that usually the normal asymptotic theory used to approximate the 
p-values for these tests. In this note we suggest using a saddlepoint approach 
that almost exact and need no extensive simulation calculations to calculate 
the p-value of such class of tests. 

Some key words: Independence tests; Linear rank test; Permutation distri- 
bution; Saddlepoint approximation. 



1 Introduction 

When the factors being studied are not treatments that the investigator can assign 
to his subjects but conditions or attributes which are inseparably attached to these 
subjects, an assumption that need to be tested is that an association exists between 
two factors in a population of subjects. Let us observe N independent pairs of random 
variables Yi), (X 2 , Y 2 ), (X N , Y N ) and we wish to test the null hypothesis H 
that the two variables Xi and Y( are independent for each i. 
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Now rearrange all N pairs of observations according to the magnitude of their 
first coordinate into the sequence (X^, K0> (Xd 2 , Y^), (Xa N , Yd N ) in such a way 
that Xd 1 < Xd 2 < ■ ■ ■ < Xd N . Then put Ri equal to the rank of among the 
observations Yd 1 ,Yd 2 , ...,Yd N . Under the assumption of independence and assuming 
no ties, all N\ orderings (i?i, ...,Rjy) are equally likely with probability I/AH. If we 
willing to assume that the two factors have a positive associations, the {-Ri} should 
reveal an upward trend, with large values tending to occur on the right of the sequence 
and low values on the left. An appropriate test statistic that reflects this idea is 

N 

D = ^(i? 4 -z) 2 (1) 

i=l 

with small values of D indicating significance. 

The statistic D is related to the well known Spearman's coefficient of rank cor- 
relation statistic, S p , with the relation S p — 1 — 6D/N(N 2 — 1), see Gibbons and 
Chakraborti (2003). It is also related to the weighted Mann statistic, D', by D' = 
lN(N 2 - 1) - \D. 

Expanding (pQ), D can be written as 

1 - 

D = —N(N + 1)(2N + 1) - 2 V iRi 

6 <=i 
which gives an equivalent simple statistic 

N 

V' = Y, iR i ( 2 ) 

Hajek, Sidak and Sen (1999). 

The statistic V is equivalent to a general class of rank statistics whose null dis- 
tributions are invariant under permutations, this class can be written as 

N 

s = ^2M()MRi) (3) 

1=1 
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which contains the Fisher- Yates normal score test with f N (i) = EUtf , where Ufi> < 
Upp < ■ ■ ■ < being an ordered sample of N observations from the standardized 
normal distribution, the van der Waerden test statistic, with /at(«) = $~Hjv+i)' 
where $ is the standard normal distribution function and the quadrant test statistic 
with /jv(z) = sign(i — ^p). 

Saddlepoint approximation to randomization distributions were introduced by 
Daniels (1958) and further developed by Robinson (1982) and Davison and Hinkly 
(1988). Booth and Buter (1990) showed that various randomization and resampling 
distributions are the same as certain conditional distributions and that the double 
saddlepoint approximation attains accuracy comparable to the single saddlepoint 
approach. Recently, Abd-Elfattah and Butler (2007) used the double saddlepoint 
approximation to calculate the p- values and confidence interval for the class of linear 
rank two sample statistics for censored data. 

In this note we present a simple, fast and accurate saddlepoint approach that 
does not need any extensive permutation simulations, to calculate the exact p-value 
for the previous class of tests using double saddlepoint approximation. To use the 
double saddlepoint approximation, the following lemma reformulate the class to 
more appropriate simple form. 

Lemma 1 The class of statistics (Ej) can be written in an equivalent form as 



where L T = (/#(!.), /zv(2), /jv(AT)), and Zi,Z 2 , ...,Zn are N x 1 vectors of the 
form Z R . = ?7j, i — 1, N, where the N x N identity matrix Jjy — (ViiVii •••! ? ?7v)- 

Proof. Simple algebra. ■ 

For example, if R\ = 2 is arithmetical rank so that Z 2 = T] 1 and YliLi ^ nas a 2 
in its first component for R\. 



N 




(4) 



i=l 
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Section 2 presents the saddlepoint approximation approach. A real data example 
has illustrated in section 3 along with a simulation study to show the performance of 
the saddlepoint method. The application of the saddlepoint method to Cuzick (1982) 
test statistic in case of interval censoring is discussed in section 4. 

2 Saddlepoint Approximation for Tests of Inde- 
pendence 

Under the null hypothesis H of independence, the permutation distribution of V 
places a uniform distribution on the set of N x 1 indicator vectors {Zi}. This 
distribution may constructed from a corresponding set of i.i.d. JV x 1 vectors of 
Multinomial(l,9i,9 2 , ...,0 N ) indicators Ci, £ 2 > ■■■■> Cat- The permutation distribution 
over all one way design for which J2f=i = (1> ■■■■> 1) T * s constructed from the i.i.d. 
Multinomial variables as the conditional distribution 

N 

Zi, Z N — Ci, CjvI = ^wxi 

1=1 

the dependence in the statistic can be removed by using (N — 1) x 1 vectors Z~ and 
the first N — 1 components in Zi and then 

n 

%1 !■■■■> Z n — Cl > •••) Cn I ^ Ci —(!)•••) l)^AT-l)xl 

i=l 

and then V can be represented in terms of as 

N 

V = L T ^fN^Zr + Q 
i=i 

where = (f N (l) - f N (N), f N (N - 1) - /^(iV)) and Q = f N (N) ^ 
If vq is the observed statistic value of V, then the null distribution of V is 

N N 

Pr{V > v } = Pr{T(C) = L T _ J] + Q > I ^ C = (1, - , if} 

i=i i=i 
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Assuming any probability vector {9i, 9 2 , 9n} for the Multinomial distribution, 
the conditional distribution of T((^ , £J, Cv) * s the required permutation distribu- 
tion which can be approximated by using the double saddlepoint approximation of 
Skovgaard (1987). 

The p- value is approximated from the double saddlepoint procedure in which 
uses the joint cumulant generating function for (T(£7, £2 , Cv)> Ylu=i Ci~) given by 
K(s,t) = log M(s,t) where 

N ( N—l 

i=i L j=i 

with s = (si, sjv-i) and r^- = _ /jvW), and then 

Pr(V > u ) ~ 1 - - (4-4 

where 



sgn(*W2 [-{X (s,t) - - t; t}] 



« = f^|tf"(M*)l/«(0,0)|. 



and 1_ is (AT — 1) x 1 vector of ones. In these expressions, X" is the N x N Hessian 
matrix and K" s is the d 2 /dsds T portion at (0,0). The saddlepoint (s,t) solves 

K'(st) = T exp(%+r^) 

*=i (z^=i ex P ( s / + + !j 

*=i |l«=i ex p \ s i + + 1 j 

using 0, = 1/iV the denominator saddlepoint equations have an explicit solution as 
s Q = and this simplifies the calculations. 
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3 Example and Simulation Study 

Nayak (1988) gives the failure times of transmission (X) and of transmission pumps 
(Y) on 15 caterpillar tractors as shown in table 1. 



X 


1641 


5556 


5421 


3168 


1534 


6367 


9460 


6679 




6142 


5995 


3953 


6922 


4210 


5161 


4732 




Y 


850 


1607 


2225 


3223 


3379 


3832 


3871 


4142 




4300 


4789 


6310 


6311 


6378 


6449 


6949 





Table 1. Failure times of transmissions by Nayak (1988). 



To test the independence of failure times of X and Y, the test statistic (J2J) are 
used with L = (1,...,N), and Q = LjvEii^ = n2 ( N + l )l 2 - The true ( simu " 
lated ) p- value was calculated by using 10 6 permutations of the computed test statis- 
tic. The simulated p-value is then the proportion of such generations exceeding the 
observed statistic plus the proportion of those equal. The p-value of the saddle- 
point approach is compared to the normal p-value calculated using the test statistic 
(v' — E{v'))/ yJVariv'). The true p-value and the saddlepoint approximated p-value 
were 0.2768 and 0.2763, respectively, while the normal p-value was 0.2693. 

A small simulation study has carried out to assist the performance of the saddle- 
point method. Consider the general model of dependence 

X t = Xl + Xe u Yi = Yf + \ei, i = l,...,N 

where all the variables X[, Y- and ej are mutually independent and their distributions 
do not depend on i, and A is a real non-negative parameter. In this model the null 
hypothesis H of independence is equivalent to A = 0, whereas for A > the variables 
Xi and Yi are dependent. Data sets are generated from this model using Logistic, 
Extreme value and Uniform distributions for X-, Y{ and e« respectively. For each 
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value of A = 0.0,0.5 and sample sizes (10,20,30), a 1000 data sets are generated 
and the true, saddlepoint and normal p-values are calculated using the test statistic 
(J2]). Table 2 shows the proportion of the 1000 data sets that saddlepoint p- value was 
closer to the true p- value than the normal p-value "Sad. Prop." , "Abs. Err. Sad." is 
the average absolute error of the saddlepoint p-value from the true p-value, and "Rel. 
Abs. Err. Sad." is the average relative absolute error of the saddlepoint p-value from 
the true p-value. 







Sad. 


Abs. Err. 


Rel. Abs. 


n 


A 


Prop. 


Sad. 


Err. Sad 


10 


0.0 


0.944 


0.0010 


0.0048 




0.5 


0.945 


0.0010 


0.0050 


20 


0.0 


0.957 


0.0003 


0.0016 




0.5 


0.956 


0.0003 


0.0018 


30 


0.0 


0.943 


0.0003 


0.0012 




0.5 


0.932 


0.0003 


0.0013 



Table 2. Performance under simulation for the dependence test. 

The saddlepoint p-value was more accurate in 94.6% of the overall cases as com- 
pared to the normal approximation. The average absolute saddlepoint error was less 
than 10~ 3 with average relative error typically less than 0.1%. An important consid- 
eration in these saddlepoint computations is the difficulty in solving iV saddlepoint 
equations. This becomes increasingly difficult with large N. 

4 Discussion 

The problem of testing the independence between two variables under random cen- 
soring has taken attention of many authors, see O'Brien (1978), Wei (1980), Oakes 
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(1982), and Gieser and Randies (1997). When one of the two variables is under in- 
terval censoring, say the first, and the second random variable is observed, Cuzick 
(1982) presents a linear log-rank test statistic to test the independence of two vec- 
tors in the form ^2f =1 ^R2i, where {^} are given scores and {R,2i} are the ranks of 
the observed values of the second random variable. In the linear form taking 
L = (£ 1; £ 2 5 •••;^Ar)and /jv(-Ri) = R2i, the saddlepoint method is simply applicable. 
For example, Cuzick gives a survival times for 20 patients for the analysis of the re- 
lation between hemoglobin at presentation and survival in some medical clinice. The 
normal p- values using his asymptotic approach was 0.0505 while the true p- value and 
the saddlepoint p-value are 0.0516 and 0.0512, respectively. 
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